Text mining in an example

  • Garry works at Bol.com (a webshop in the Netherlands)

  • He works in the dep of Customer relationship management.

  • He reads customers’ reviews (comments), extracts aspects they wrote their reviews on, and identifies their sentiments.

  • Curious about his job? See two examples!

This is a nice book for both young and old. It gives beautiful life lessons in a fun way. Definitely worth the money!

+ Educational

+ Funny

+ Price


Nice story for older children.

+ Funny

- Readability

Example

  • Garry likes his job a lot, but sometimes it is frustrating!

  • This is mainly because their company is expanding quickly!

  • Garry decides to hire Larry as his assistant.

Example

  • Still, a lot to do for two people!

  • Garry has some budget left to hire another assistant for couple of years!

  • He decides to hire Harry too!

  • Still, manual labeling is labor-intensive!

Challenges?

  • What are the challenges Garry, Larry, and Harry encounter in doing their job, when working with text data?

Challenges with text data

  • Huge amount of data

  • High dimensional but sparse

    • all possible word and phrase types in the language!!

Challenges with text data

  • Ambiguity

Challenges with text data

  • Noisy data

    • Examples: Abbreviations, spelling errors, short text
  • Complex relationships between words

    • “Hema merges with Intertoys”

    • “Intertoys is bought by Hema”

Back to the story

Example

  • During one of the coffee moments at the company, Garry was talking about their situation at the dep of Customer relationship management.

  • When Carrie, her colleague from the IT department, hears the situation, she offers Garry to use Text Mining!!

  • She says: “Text mining is your friend; it can help you to make the faster by filtering and recommending possible words…”

  • She continues : “ Text mining is a subfield of AI and NLP and is related to data science, data mining and machine learning. It will make the process faster and cuts some of the expenses!

  • After consulting with Larry and Harry, They decide to give text mining a try!

Example

Text mining definition?

  • Which can be a part of Text Mining definition?
    • The discovery by computer of new, previously unknown information from textual data
    • Automatically extracting information from text
    • Text mining is about looking for patterns in text
    • Text mining describes a set of techniques that model and structure the information content of textual sources


(You can choose multiple choices)

Go to www.menti.com and use the code 22 07 62 0

Text mining definition

  • “the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources” Hearst (1999)

  • Text mining is about looking for patterns in text, in a similar way that data mining can be loosely described as looking for patterns in data.

  • Text mining describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources. (Wikipedia)

Logistics

Access

Lecturers and Assistants

José de Kruif

Dong Nguyen

Program

Time Monday Tuesday Wednesday
9:00 - 10:30 Lecture 1 Lecture 3 Lecture 5
Break Break Break
10:45 – 11:45 Practical 1 Practical 3 Practical 5
11:45 – 12:30 Discussion 1 Discussion 3 Discussion 5
12:30 – 14:00 Lunch Lunch Lunch
14:00 – 15:30 Lecture 2 Lecture 4 Lecture 6
Break Break Break
15:45 – 16:30 Practical 2 Practical 4 Practical 6
16:30 – 17:00 Discussion 2 Discussion 4 Discussion 6

Goal of the course

  • The course teaches students text mining techniques using R on a variety of applications in many domains of science.

Do you have any questions?

  • During the lecture

    • Post your question to the chat; we will read them during a break.
  • During the computer lab

    • Post your question in general; we will answer them.
  • After the lecture

    • Feel free to send me an email or text me in Microsoft Teams

Introduction

Ayoub Bagheri, a.bagheri@uu.nl

Utrecht Summer School: Applied Text Mining

Another TM definition

Text mining process

Text Classification

Sentiment/Opinion Analysis

Statistical machine translation

Dialog Systems

Question answering

Go beyond search

And more …

  • Automatically classify political news from sports news

  • Authorship identification

  • Age/gender identification

  • Language Identification

  • Sentiment analysis

Disease Classification

Contribution to ASReiew

Text mining tasks

  • Text classification
  • Text clustering


We will also cover:

  • Sentiment analysis
  • Feature selection
  • Topic modelling
  • Word embedding
  • Deep learning models
  • Responsible text mining
  • Text summarization

Text classification

  • Supervised learning
  • Human experts annotate a set of text data
    • Training set
  • Learn a classification model
Document Class
Email1 Not spam
Email2 Not spam
Email3 Spam

Text classification?

  • Which problem is not a text classification task? (less likely to be)

    • Author’s gender detection from text

    • Finding about the smoking conditions of patients from clinical letters

    • Grouping news articles into political vs non-political news

    • Classifying reviews into positive and negative sentiment


Go to www.menti.com and use the code 86 08 86 5

Text clustering

  • Unsupervised learning
  • Finding Groups of Similar Documents
  • No labeled data
Document Cluster
News article1 ?
News article2 ?
News article3 ?

Text Clustering?

  • Which problem is not a text clustering task? (less likely to be)

    • Grouping similar news articles

    • Grouping discharge letters in two categories: heart disease vs cancer

    • Grouping tweets which support Trump into three undefined subgroups

    • Grouping online books of a library in 10 categories


Go to www.menti.com and use the code 86 08 86 5

Regular expressions

Regular Expressions

Regular Expressions

Understanding Regular Expressions

  • Very powerful and quite cryptic

  • Fun once you understand them

  • Regular expressions are a language unto themselves

  • A language of “marker characters” - programming with characters

  • It is kind of an “old school” language - compact

Regular expressions

  • A formal language for specifying text strings

  • How can we search for any of these?

    • woodchuck

    • woodchucks

    • Woodchuck

    • Woodchucks

Regular Expressions: Disjunctions

  • Letters inside square brackets []

  • Ranges [A-Z]

Regular Expressions: Negation in Disjunction

  • Negations [^Ss]

    • Carat means negation only when first in []

Regular Expressions: More Disjunction

  • Woodchucks is another name for groundhog!
  • The pipe | for disjunction

Regular Expressions: ? * + .

Regular Expressions: Anchors ^ $

Example

  • Find me all instances of the word “the” in a text.
the

Misses capitalized examples



[tT]he

Incorrectly returns other or theology



[^a-zA-Z] [tT]he [^a-zA-Z]

Errors

  • The process we just went through was based on fixing two kinds of errors

    • Matching strings that we should not have matched ( there, then, o ther)

      • False positives (Type I)
    • Not matching things that we should have matched (The)

      • False negatives (Type II)

Errors cont.

  • In NLP we are always dealing with these kinds of errors.

  • Reducing the error rate for an application often involves two antagonistic efforts:

    • Increasing accuracy or precision (minimizing false positives)

    • Increasing coverage or recall (minimizing false negatives).

Regular Expression Quick Guide

^ Matches the beginning of a line
$ Matches the end of the line
. Matches any character
∖s Matches whitespace
∖S Matches any non-whitespace character
* Repeats a character zero or more times
*? Repeats a character zero or more times (non-greedy)
+ Repeats a chracter one or more times
+? Repeats a character one or more times (non-greedy)
[aeiou] Matches a single character in the listed set
[^XYZ] Matches a single character not in the listed set
[a-z0-9] The set of characters can include a range
( Indicates where string extraction is to start
) Indicates where string extraction is to end

In R: Medical example

Summary

Summary: what did we learn?

  • Regular expressions play a surprisingly large role

    • Sophisticated sequences of regular expressions are often the first model for any text processing text
  • For many hard tasks, we use machine learning classifiers

    • But regular expressions are used as features in the classifiers

    • Can be very useful in capturing generalizations

  • Regular expressions are a cryptic but powerful language for matching strings and extracting elements from those strings

  • Regular expressions have special characters that indicate intent

Practical 1